3.2 - Markov models
Discrete states of DNA¶
Nucleic acids (deoxyribonucleic acid [DNA] and its ribonucleic version [RNA]) are composed of units called nucleotides. In DNA, each nucleotide is conformed by one deoxyribose sugar, one phosphate group, and one of four nitrogen nucleobases (adenine [A], cytosine [C], guanine [G], or thymine [T]). In the case of RNA, the sugar is ribose and the thymine base is replaced by uracil (U). These nucleotides are classified into two main chemical groups, purines and pyrimidines. In DNA, two chains are bound together by a series of hydrogen bonds between a specific purine and its complementary pyrimidine. Adenine pairs with thymine and guanine with cytosine.
Variability in this configuration is broad but rare, and we can find some non-canonical bases, bounds, or even rearrangement in the whole structure. Regardless, as we see, DNA offers a wide variety of discrete characters to incorporate in models to study it or predict patterns. For inferring phylogenies, the nitrogen nucleobase (A, C, G, T) has been one of the primary sources of information when comparing sequences and calculating distances; nevertheless, we have some hypothetical models that incorporate other information to predict the distribution in a given sequence.
DNA is constely exposed to different phenomena that may change or mutate their sequence, with puntual mutations (mutations that change one base for another) or much larger mutations. Mutation can be caused by multiple situations (UV exposition, errors during replication or celular division, insertion or delation or segments), and even being rare, they play an importal role in organisms evolution. They are accumulable in time and from it we can infer ancestry-decendent relationships.
Considering puntual mutations, we can illustrate these possible discrete states (nucleotides) and their possible changes using the following simple graph.
Code
import toyplot
import itertools
import random
edges = [i for i in itertools.permutations('ACGT', r=2)]
transitions = {("A","G"),("C","T")}
colors = []
for e in edges:
sorted_e = tuple(sorted(list(e)))
if sorted_e in transitions:
colors.append("blue")
else:
colors.append("red")
canvas, axes, graph = toyplot.graph(
[i[0] for i in edges],
[i[1] for i in edges],
ecolor=colors,
ewidth=5,
eopacity=0.35,
width=350,
height=350,
margin=0,
tmarker=">",
vsize=50,
vcoordinates=[(-1,1),(-1,-1),(1,1), (1,-1)],
vstyle={"stroke": "black", "stroke-width": 2, "fill": "none"},
vlstyle={"font-size": "20px"},
layout=toyplot.layout.FruchtermanReingold(edges=toyplot.layout.CurvedEdges())
)
f_style = {"font-size": "14px"}
canvas.text(175,40, "Transitions", style=f_style)
canvas.text(175,310, "Transitions", style=f_style)
canvas.text(40,175, "Transversions", angle=90, style=f_style)
canvas.text(175,175, "Transversions", style=f_style)
canvas.text(310,175, "Transversions", angle=270, style=f_style);
canvas.style['background-color'] = 'white'
These discrete states and possible changes can be modeled using Markov chains to predict some expected patterns in the sequences.
Markov chains¶
A Markov chain is a a system that describe the transition (do not confuse with nucleotide transitions) from one state to another considering only some simple probabilistic rules. In this stochastic model every change is only conditioned of the previous state, this is known as the Markov property. In terms of nucleotides, the probability of having a given nucleotide depends only of the previous nucleotide.
In this way we can consider that Markov chains are "memoryless" process, in which only the previous state can affect the new state, while the previous states are forgotten. This principle is true and only true in 1st-order Markov chains. As expected, there are high-level order Markov chains that extend the "memory" to a given number of previous states; however, in terms of substitutions models, 1st-order is the type of chain that is commonly used.
In the following animated graph, we can see how is a random walk in a Markov chain of 4 states.
One crucial part of these models is the transitions probabilities between states. These can be stored and represented as a matrix (symmetrical or asymmetrical). In this matrix every row would sum 1.
We can declare this matrix in Python using a Numpy array as follows:
import numpy as np
# A C G T
transition_matrix = np.array([[0.50, 0.20, 0.10, 0.20], # A
[0.10, 0.50, 0.30, 0.10], # C
[0.05, 0.20, 0.70, 0.05], # G
[0.00, 0.05, 0.05, 0.90] # T
])
Using the transition probabilities in this matrix we can do a random walk changing from one state to another a given number of steps. In the next function we implemented a simple way to do this and store each state visited in this Markov chain.
def print_random_walk(steps: int,
initial_state: str,
transition_matrix: np.ndarray,
states: str = "ACGT") -> str:
"""Do a random walk and return the sequence of all states visited
Parameters
----------
states
States to consider in the Markov chain
steps
Number of steps in the random walk
initial_state
Where to start the random walk
transition_matrix: np.array
Transition matrix with transition probabilities for all states
Returns
-------
Sequence of all states visited
"""
current_state = initial_state
result_sequence = [current_state]
step = 1
while step < steps:
previous_state = current_state
previous_idx = states.index(previous_state)
# Choice one random base on the probabilities on the matrix picking the previous row
current_state = np.random.choice(list(states), p=transition_matrix[previous_idx])
result_sequence.append(current_state)
step += 1
return "".join(result_sequence)
With this random walk function, we can generate a random sequence that follows the transition probabilities we stated in our matrix transition:
sequence = print_random_walk(states="ACGT",
steps=200,
initial_state="G",
transition_matrix=transition_matrix)
sequence
'GGGGCGCCCTTTCGGGGAATTTTTTTTTTTTCCGGCAAAGGGGCCTTTTTTCTTTTTTTTTTTCCTTTTTTTTTTCCGGCCAAATTTTTTCCCCCGTTTTTCAATTTTTTTTTTTTTTTTTTTTCCCGGGCCAACCGGGAAACCCATTTTCGGGGCCCCCCCCCGTTTTTGGCCCCCGGGGGGGGCGTTTTTTTTTTTTT'
We can notice that this random sequence shows some patterns, for example more T's than A's, a similar number of C's and G's, and islands of certain bases. The transition probabilities we set influence these patterns notoriusly, and at some point the distribution of the states (in this case nucleotides) are predictable. If the sequence of changes is long enough we can predict how many T's, for example, we can have in a random sequence that follows a particular model.
That predictability is represented by the stationary distribution or equilibrium in a Markov model.
Stationary distributions¶
Stationary distributions, also known as the equilibrium, are probability distributions that remain unchanged as time progresses. In other words, the distribution of every state in a sequence of changes tends to converge with time, and it will be the same after some point.
These stationary distributions are vectors with the distribution of all states in the model and are commonly denoted with π. They constitute an essential part of implementing models for calculating expected distributions in DNA sequences.
We can see this when we do several random walks and start noticing that the number of T's, for example, is very similar from one sequence to another. Let us generate 10 sequences with 1000 bases by doing different random walks and counting the number of T's we have in each of them.
for i in range(10):
sequence = print_random_walk(states="ACGT",
steps=1000,
initial_state="G",
transition_matrix=transition_matrix)
n_ts = sequence.count("T")
print(f"Seq. {i}: {sequence[0:10]}...{sequence[-10:-1]} - T's = {n_ts}")
Seq. 0: GGCGGGGGGG...GTTTTTTTT - T's = 430
Seq. 1: GTTTGGGGGG...TTTTTTTTT - T's = 461
Seq. 2: GGGCCGGGGG...GGCGGAAAA - T's = 423
Seq. 3: GAGGGGGGGG...TTTTTGGCC - T's = 389
Seq. 4: GGCCCCGTTT...CGCTTTTTT - T's = 535
Seq. 5: GCAAAAAACG...TTTTTTTTT - T's = 453
Seq. 6: GGGGGGGCCG...CCAAGGGAA - T's = 484
Seq. 7: GGGCCCCGGG...AAAACCCCC - T's = 437
Seq. 8: GGGGCCAAAA...TTTTTGGGG - T's = 411
Seq. 9: GGTTTTTTTT...CGCCCGGGG - T's = 549
We can see that most of them have round 450 T's (about 45%) of the entire sequence. If the sequence is long enough, we will end with the stationary distribution, and the number of T's would be predicted.
Let me illustrate this convergency with a plot where we ran for 10000 steps a random walk and tracked the frequency of each nucleotide.
import toyplot
def plot_convergency(states, steps, initial_state, transition_matrix):
current_state = initial_state
sequence = [current_state]
empiric_frequencies = np.zeros((int(steps), len(states)))
step = 1
while step < steps:
previous_state = current_state
previous_idx = states.index(previous_state)
current_state = np.random.choice(list(states), p=transition_matrix[previous_idx])
sequence.append(current_state)
step_freq = []
for base in states:
step_freq.append(sequence.count(base) / len(sequence))
empiric_frequencies[step] = step_freq
step += 1
label_style = {"text-anchor":"start", "-toyplot-anchor-shift":"5px"}
canvas, axes, mark = toyplot.plot(empiric_frequencies, ylabel="Frequency", xlabel="Steps")
canvas.style['background-color'] = 'white'
for i in states:
axes.text(steps, empiric_frequencies[-1,states.index(i)], i, style=label_style)
return canvas, axes, mark
plot_convergency(states="ACGT", initial_state="A", steps= 10000, transition_matrix=transition_matrix);
We can find these stationary distributions using different approaches. One of them is using a Monte Carlo approach in which we do a random walk long enough to get the distribution from the actual sequence of events. This method is easy to understand but it is not very efficient.
def get_stationary_dist_monte_carlo(states, steps, initial_state, transition_matrix):
# Create empty array to put the frequency of a given state
frequencies = np.zeros(len(states))
# do first step
initial_idx = states.index(initial_state)
frequencies[initial_idx] = 1
previous_idx = initial_idx
# do other steps
step = 1
while step < steps:
current_state = np.random.choice(list(states), p=transition_matrix[previous_idx])
current_idx = states.index(current_state)
frequencies[current_idx] += 1
previous_idx = current_idx
step += 1
stationary_distribution = frequencies / steps # π
return stationary_distribution
Using this method we can get the stationary distribution of our Markov chain:
get_stationary_dist_monte_carlo(states="ACGT",
steps=1e6,
initial_state="A",
transition_matrix=transition_matrix)
array([0.066472, 0.186481, 0.286193, 0.460854])
With this, we can see that the frequency of T's will be 0.46 in any sequence produced by a random walk in this Markov chain.
As we mentioned before, there are other methods to find this distribution; some are more efficient than the previous one. The following apply a repeated multiplication matrix method over the substitution matrix. Notice that it is still an iterative operation, but the iterations needed to reach the stationary distribution are fewer.
def get_stationary_dist_matrix_multiplication(states, iterations, transition_matrix):
"""Repeated matrix multiplication method"""
current_matrix = transition_matrix
iteration = 1
while iteration < iterations:
# Calculate the Matrix product of two arrays
current_matrix = np.matmul(current_matrix, transition_matrix)
iteration += 1
stationary_distribution = current_matrix[0]
return stationary_distribution
get_stationary_dist_matrix_multiplication("ACGT", iterations=100, transition_matrix=transition_matrix)
array([0.06593407, 0.18681319, 0.28571429, 0.46153846])
Rate matrices¶
Here we can introduce another concept, a rate matrix (commonly called Q matrix) in which each element describes the rate (the expected number of events per some time) of one state change into other states. Different of transition matrix, in a rate matrix every row will sum 0.
# Example of a rate matrix
# A C G T
rate_matrix = np.array([[-1.00, 0.333, 0.333, 0.333], # A
[0.333, -1.00, 0.333, 0.333], # C
[0.333, 0.333, -1.00, 0.333], # G
[0.333, 0.333, 0.333, -1.00] # T
])
The library scipy includes a library with lineal algebra operation in which we can use matrix exponentiation en order to get the probabilities of change at a given a time. Take into account that if timeis large enough you will end getting the stationary distribution (this is another way to get this vector other than Monte Carlo or repeated matrix multiplication approaches).
Because this method works well with real numbers (previous approaches only worked well with integers [steps or iterations]), we can calculate the probability of transition given a branch length.
from scipy.linalg import expm
time = 0.2
expm(rate_matrix * time)
array([[0.82443456, 0.05845515, 0.05845515, 0.05845515],
[0.05845515, 0.82443456, 0.05845515, 0.05845515],
[0.05845515, 0.05845515, 0.82443456, 0.05845515],
[0.05845515, 0.05845515, 0.05845515, 0.82443456]])
A Q matrix cannot be passed in a Markov chain because some probs are negative. It is not inteded to be the transition matrix in a Markov process. Instead the transition matrix can be used in a Markov chain without issue. Let explore this with an example simulating two sequences evolution.
Simulation of evolution under a markov model¶
In previous examples we were considering the generation of independent sequences using markov process. However, to translate this concept into sequence evolution, we need to follow the markov model among generations in the same sequence (time).
import copy
import numpy as np
STATES = [0,1,2,3]
BASES = "ACGT"
class lineage:
""" Create a lineage object that can evolve a given amount of generations
and split it in two child lineages.
"""
def __init__(self, transition_matrix, length=1000, tips=4, parent=None, states=STATES):
if not parent:
self._sequence = np.random.choice(states, size=length)
else:
self._sequence = parent._sequence.copy()
# self.ancestral_sequence = self.sequence
self.length = length
self.transition_matrix = transition_matrix
self.parent = parent
self.generations = 0
self.right = None
self.left = None
self.real_changes = np.zeros(length)
self.sequence = self._translate_sequence()
def evolve(self, generations=1):
for gen in range(int(generations)):
for index, base in enumerate(self._sequence):
new_base = np.random.choice(STATES, p=self.transition_matrix[base])
if new_base != base:
self._sequence[index] = new_base
self.real_changes[index] += 1
self.generations += 1
self.sequence = self._translate_sequence()
def cladogenesis(self):
self.right = lineage(self.transition_matrix, length=self.length, parent=self)
self.left = lineage(self.transition_matrix, length=self.length, parent=self)
return self.left, self.right
def _translate_sequence(self):
return "".join([BASES[b] for b in self._sequence])
# As mentioend above, there are two different matrices when we are considering models of substitution. One is Q matrices or instant rate matrices that help to give
# the probability of a change in a given time
# for Jukes-Cantor the rate matrix (Q) is
substitutions_site_year = 1e-2
alpha = substitutions_site_year/3
# A C G T
Q_jc = np.array([[-3*alpha, alpha, alpha, alpha ], # A
[alpha, -3*alpha, alpha, alpha ], # C
[alpha, alpha, -3*alpha, alpha ], # G
[alpha, alpha, alpha, -3*alpha] # T
])
Q_jc
array([[-0.01 , 0.00333333, 0.00333333, 0.00333333],
[ 0.00333333, -0.01 , 0.00333333, 0.00333333],
[ 0.00333333, 0.00333333, -0.01 , 0.00333333],
[ 0.00333333, 0.00333333, 0.00333333, -0.01 ]])
#given a time, we can use Q matrix to get the probability of X change to Z,
#for example, having substitutions_site_year = 1e-2 in 5 units of time (years here) the
#probability that a base maintain in the same state (A>A>A>A>A) is 0.95
# this can be done using matrix exponential
from scipy.linalg import expm
time = 5
expm(Q_jc * time)
array([[0.95163024, 0.01612325, 0.01612325, 0.01612325],
[0.01612325, 0.95163024, 0.01612325, 0.01612325],
[0.01612325, 0.01612325, 0.95163024, 0.01612325],
[0.01612325, 0.01612325, 0.01612325, 0.95163024]])
# Using this property we can calculate the transition probability matrix for this Q matrix using
#matrix exponenciation
expm(Q_jc)
array([[0.99006637, 0.00331121, 0.00331121, 0.00331121],
[0.00331121, 0.99006637, 0.00331121, 0.00331121],
[0.00331121, 0.00331121, 0.99006637, 0.00331121],
[0.00331121, 0.00331121, 0.00331121, 0.99006637]])
# For simple models like jukes cantor, the trainsion matrix is also noted in the following way (this is an approximation)
substitutions_site_year = 1e-4
alpha = substitutions_site_year/3
# A C G T
transition_matrix_jc = np.array([[1-3*alpha, alpha, alpha, alpha ], # A
[alpha, 1-3*alpha, alpha, alpha ], # C
[alpha, alpha, 1-3*alpha, alpha ], # G
[alpha, alpha, alpha, 1-3*alpha] # T
])
transition_matrix_jc
# you can see the similarities of this defined matrix with the result of matrix exponenciallity using Q matrix
array([[9.99900000e-01, 3.33333333e-05, 3.33333333e-05, 3.33333333e-05],
[3.33333333e-05, 9.99900000e-01, 3.33333333e-05, 3.33333333e-05],
[3.33333333e-05, 3.33333333e-05, 9.99900000e-01, 3.33333333e-05],
[3.33333333e-05, 3.33333333e-05, 3.33333333e-05, 9.99900000e-01]])
# To fully apply a Markov chain in a simulated evolution using a
# Jukes-Cantor model, we need to use transition matrix for controling Markov chain
# and not Q matrix
# Create a original lineage
ol = lineage(transition_matrix_jc, length=100)
# ol.sequence
# evolve a little before split it
ol.evolve(1000)
# ol.sequence
# split it in two child lineages
ch1, ch2 = ol.cladogenesis()
# ch1.sequence, ch2.sequence
# evolve child lineage
ch1.evolve(1000)
ch2.evolve(1000)
ch1.sequence, ch2.sequence
('GATATTTTGACATTCCTAAATAATGGATTTTACCCTTTGAGTTCGTCGCTACCTAAGCCTGCGGGGATCGCTTGAGCAGTTGAAGCTAGCTCTTCACGAT',
'TATAATTTGACATACCTTAATAATGGGTTTCACTCTTTCACTTCGTCACTCTCTTAGCCTGCGGCCATCGCATGTGCAGTTGATGCTAGCTCCCCACGAT')
# check the real changes from the ancestral state to the current state
ch1.real_changes, ch2.real_changes
(array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2., 0., 0., 0.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1.,
0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
array([1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 2., 1., 0., 0., 0., 0., 0., 0.]))
The real changes in this sequences were stored in our lineage object, so we can know exactly how different they are
(ch1.real_changes.sum() + ch2.real_changes.sum()) / ch1.length
0.25
p-distance is just the number of different sites over the number of total sites in the compared sequence
p_distance = np.not_equal(ch1._sequence, ch2._sequence).sum() / ch1.length
p_distance
0.19
let's correct that number using jc model
corrected_distance_jc = -3/4 * np.log(1 - 4/3 * p_distance)
corrected_distance_jc
0.21910231710087086
In this small scale example is not very impressive the result, but one think is that p-distance alone underestimate notoriously the real differences between two sequences.
Let's simulate a big sequence (1000 bp) and run it for a longer time, tracking in each generation (say years) real changes, p-distance and the corrected distance
Tracking distances along time¶
generations = 10000
results = np.zeros((generations, 3))
ol = lineage(transition_matrix_jc, length=1000)
ol.evolve(100)
ch1, ch2 = ol.cladogenesis()
for g in range(generations):
ch1.evolve()
ch2.evolve()
p_distance = np.not_equal(ch1._sequence, ch2._sequence).sum() / ch1.length
real_distance = (ch1.real_changes.sum() + ch2.real_changes.sum()) / ch1.length
# if distances is greater than 3/4 the equation does not exist
if p_distance < 3/4:
# correct distance using jc model
corrected_distance_jc = -3/4 * np.log(1 - 4/3 * p_distance)
else:
corrected_distance_jc = np.inf
results[g] = [corrected_distance_jc, p_distance, real_distance]
import toyplot
canvas, axes, mark = toyplot.plot(results, ylabel="Distance", xlabel="Generations");
axes.text(generations, results[-1][0], "With model", style={"text-anchor":"start", "-toyplot-anchor-shift":"5px"});
axes.text(generations, results[-1][1], "Without model", style={"text-anchor":"start", "-toyplot-anchor-shift":"5px"});
axes.text(generations, results[-1][2], "Real distance", style={"text-anchor":"start", "-toyplot-anchor-shift":"5px"});
# now a try, evolve a sequence under k2p but correct with jc to see how wrong it is
R = 0.5 # ration of transtitions/transversions
alpha = R / R + 1
beta = (1/2) * 1 / R + 1
alpha = 1e-2 # substitution rates for transitions (substitutions_site_year only for transitions)
beta = 1e-2 # substitution rates for transversions (substitutions_site_year only for transversions)
R = alpha / beta
R
0.01
# in k2p Q matrix is:
#Let the substitution rates be α for transitions and β for transversions
# A C G T
Q_k2p = np.array([[-(alpha+2*beta), beta, alpha, beta], # A
[beta, -(alpha+2*beta), beta, alpha], # C
[alpha, beta, -(alpha+2*beta), beta], # G
[beta, alpha, beta, -(alpha+2*beta)] # T
])
Q_k2p
array([[-104., 51., 2., 51.],
[ 51., -104., 51., 2.],
[ 2., 51., -104., 51.],
[ 51., 2., 51., -104.]])
expm(Q_k2p)
array([[9.89901347e-01, 9.99800027e-05, 9.89869341e-03, 9.99800027e-05],
[9.99800027e-05, 9.89901347e-01, 9.99800027e-05, 9.89869341e-03],
[9.89869341e-03, 9.99800027e-05, 9.89901347e-01, 9.99800027e-05],
[9.99800027e-05, 9.89869341e-03, 9.99800027e-05, 9.89901347e-01]])
# in k2p the P matrix (transition probabilities) is given by the following elements P0 P1 and P2
## tsting expm(Q_k2p) to get transition probabilities matrix
generations = 10000
results = np.zeros((generations, 2))
ol = lineage(expm(Q_k2p), length=1000)
ol.evolve(100)
ch1, ch2 = ol.cladogenesis()
for g in range(generations):
ch1.evolve()
ch2.evolve()
p_distance = np.not_equal(ch1._sequence, ch2._sequence).sum() / ch1.length
real_distance = (ch1.real_changes.sum() + ch2.real_changes.sum()) / ch1.length
# if distances is greater than 3/4 the equation does not exist
if p_distance < 3/4:
# correct distance using jc model
corrected_distance_jc = -3/4 * np.log(1 - 4/3 * p_distance)
else:
corrected_distance_jc = np.inf
results[g] = [corrected_distance_jc, p_distance, real_distance]
import toyplot
canvas, axes, mark = toyplot.plot(results, ylabel="Distance", xlabel="Generations");
axes.text(generations, results[-1][0], "With model", style={"text-anchor":"start", "-toyplot-anchor-shift":"5px"});
axes.text(generations, results[-1][1], "Without model", style={"text-anchor":"start", "-toyplot-anchor-shift":"5px"});
# 11.17
import math
u = rate
t = ch1.generations
ds = 3/4 * (1 - math.e ** (-4/3 * u * t))
ds
# 11.18
corrected_distance_jc = -3/4 * np.log(1 - 4/3 * ds)
corrected_distance_jc